Use XGrammar for structured output generation #118

mattt · 2026-02-09T12:59:47Z

Follow-up to #106

Copilot

Pull request overview

Updates the on-device structured JSON generation path to use XGrammar-based token masking, replacing the prior schema-constrained generator flow in the MLX and Llama model implementations.

Changes:

Add swift-xgrammar dependency and integrate XGrammar into the package target.
Implement XGrammar-driven structured JSON generation loops for MLXLanguageModel and LlamaLanguageModel.
Add new structured-generation error cases (schemaEncodingFailed, grammarMismatch) in both models.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.

File	Description
Sources/AnyLanguageModel/Models/MLXLanguageModel.swift	Switch structured JSON generation to XGrammar matcher + token-bitmask sampling.
Sources/AnyLanguageModel/Models/LlamaLanguageModel.swift	Switch structured JSON generation to XGrammar matcher + token-bitmask sampling; make generation async.
Package.swift	Add `swift-xgrammar` dependency and link `XGrammar` product to the main target.
Package.resolved	Update resolved pins to include `swift-xgrammar` (and other pin changes).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T13:07:23Z

Sources/AnyLanguageModel/Models/LlamaLanguageModel.swift

+                    let token = try backend.sample(using: bitmask, applyMask: needsMask)
+                    if backend.endTokens.contains(token) {
+                        break
+                    }
+                    guard matcher.accept(Int32(token)) else {
+                        throw LlamaLanguageModelError.grammarMismatch
+                    }


The generation loop breaks immediately when a sampled token is in endTokens, before checking/advancing the grammar matcher. If the model samples EOS/EOT before the grammar has fully terminated (or in a step where masking is not applied), this will return incomplete/invalid JSON. Consider only stopping on end tokens when matcher.isTerminated is already true (or accept the token into the matcher and verify termination) so early EOS doesn’t truncate the structured output.

Copilot · 2026-02-09T13:07:23Z

Sources/AnyLanguageModel/Models/LlamaLanguageModel.swift

+        private func tokenizerInfo(
+            for vocab: OpaquePointer,
+            vocabSize: Int,
+            stopTokens: Set<Int>
+        ) throws -> TokenizerInfo {
+            guard vocabSize > 0 else {
+                throw LlamaLanguageModelError.contextInitializationFailed
+            }
+
+            var encodedVocab: [String] = []
+            encodedVocab.reserveCapacity(vocabSize)
+            for tokenId in 0 ..< vocabSize {
+                let token = llama_token(tokenId)
+                encodedVocab.append(tokenToText(vocab: vocab, token: token) ?? "")
+            }
+
+            let stopTokenIDs = stopTokens.map { Int32($0) }
+            return try TokenizerInfo(
+                encodedVocab: encodedVocab,
+                encoding: .byteFallback,
+                stopTokenIDs: stopTokenIDs,
+                addPrefixSpace: false
+            )
+        }


Building TokenizerInfo reconstructs the entire encodedVocab (size = vocabSize) for every structured-generation request. This is expensive and deterministic for a given model/vocab; consider caching the computed TokenizerInfo (keyed by vocab pointer + vocabSize, or by model ID) to avoid repeatedly traversing the entire vocabulary.

Copilot · 2026-02-09T13:07:24Z

Package.swift

+                .product(name: "XGrammar", package: "swift-xgrammar"),
                .product(


XGrammar is only imported/used inside the MLX and Llama conditional compilation blocks, but the target depends on it unconditionally. This forces consumers to fetch/build swift-xgrammar even when neither trait is enabled, and can break traitless builds if that dependency has platform/toolchain constraints. Consider making the XGrammar product dependency conditional on the MLX and/or Llama traits (similar to MLXLLM / LlamaSwift).

Suggested change

.product(name: "XGrammar", package: "swift-xgrammar"),

.product(

.product(

name: "XGrammar",

package: "swift-xgrammar",

condition: .when(traits: ["MLX", "Llama"])

),

.product(

Copilot · 2026-02-09T13:07:24Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

        let backend = try MLXTokenBackend(
            context: context,
            input: lmInput,
            parameters: generateParameters,
            maximumTokens: maxTokens,
            endTokens: []
        )
-
-        var generator = try ConstrainedJSONGenerator(backend: backend, schema: schema)
-        let json = try generator.generate()
+        let jsonSchema = try jsonSchemaString(for: schema)
+        let grammar = Grammar(jsonSchema: jsonSchema, formatting: .compact, strictMode: true)
+        let tokenizerInfo = try tokenizerInfo(
+            for: context.tokenizer,
+            vocabSize: backend.vocabSize,
+            stopTokens: backend.endTokens
+        )
+        let matcher = try await grammar.matcher(
+            for: tokenizerInfo,
+            stopTokens: backend.endTokens.map { Int32($0) },
+            terminatesWithoutStopToken: true
+        )


Structured generation constructs MLXTokenBackend with endTokens: [], which disables the backend’s normal EOS/EOT end-token detection. With the new XGrammar path, stopTokens and the loop’s early-stop check both rely on backend.endTokens, so this ends up passing an empty stop-token set to the matcher and prevents clean termination via EOS. Consider passing endTokens: nil here (use the backend’s default end tokens) and plumbing those into the matcher / termination logic so generation can stop cleanly when the model emits EOS after the grammar is satisfied.

Copilot · 2026-02-09T13:07:24Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

+            if applyMask {
+                var allowedIndices: [UInt32] = []
+                allowedIndices.reserveCapacity(vocabSize)
+                for tokenId in 0 ..< vocabSize where bitmask.isTokenAllowed(tokenId) {
+                    allowedIndices.append(UInt32(tokenId))
+                }
+                guard !allowedIndices.isEmpty else {
+                    throw MLXLanguageModelError.grammarMismatch
+                }
+                let allowedArray = MLXArray(allowedIndices)
+                let maskedLogits = full(logits.shape, values: -Float.infinity)
+                maskedLogits[0..., allowedArray] = logits[0..., allowedArray]
+                let sampledToken = sampler.sample(logits: maskedLogits)
+                return sampledToken.item(Int.self)
+            }


The masked sampling path is O(vocabSize) per generated token (scans every token ID, builds an indices array, and allocates a full maskedLogits tensor each step). For typical vocab sizes (50k–200k) this can be a major bottleneck for structured generation. Consider reusing buffers across steps and/or applying the mask more directly (e.g., materializing only the mask once per step without reserving vocabSize, or using a bitmask→indices iterator if XGrammar exposes one) to avoid repeated full-vocab scans and allocations.

Use XGrammar for structured output generation

ec55a61

mattt requested a review from Copilot February 9, 2026 12:59

Copilot started reviewing on behalf of mattt February 9, 2026 13:00 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use XGrammar for structured output generation #118

Use XGrammar for structured output generation #118

mattt commented Feb 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Copilot AI Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		.product(name: "XGrammar", package: "swift-xgrammar"),
		.product(

Use XGrammar for structured output generation #118

Are you sure you want to change the base?

Use XGrammar for structured output generation #118

Conversation

mattt commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattt commented Feb 9, 2026 •

edited

Loading